Tencent's YouTu Lab and other institutions have released the first open source multimodal large language model VITA, aimed at bridging the gap in processing Chinese dialects. Based on the Mixtral8×7B model, VITA expands the Chinese vocabulary and undergoes bilingual instruction fine-tuning, mastering both English and Chinese. Key features include: 1. **Multimodal Understanding**: VITA can handle video, images, text, and audio, which is unprecedented among open source models. 2. **Natural Interaction**: No specific wake words are required, allowing for instant response while maintaining polite and non-intrusive communication.